John Rauser: Statistics Without the Agonizing Pain
Teaching statistics by way of mathematics is like teaching philosophy by way of ancient Greek
(Paraphrased from Wallis and Roberts (1956)).
Cobb (2007)
From Mosquito, beer dataset.
Mean of beer values is: 23.60
Mean of water values is: 19.17
Mean difference is: 4.43
Or the null universe.
Or the null world.
Null means “not any”.
Define a world in which the difference of interest is set to zero.
There is not any difference in the population from which
beerhas been drawn, and the population from whichwaterhas been drawn.
If we draw a very large number of similar samples for beer and water, then the average mean difference will approach 0.
Is this observed value plausibly interpreted as a mean difference of samples from the null-world?
Edvard Munch (1893) “The Scream”, photo by Richard Mortel, licensed with CC-By.
We would like to draw a very large number of beer and water samples from the null world.
For each null-world sample, we calculate the mean difference.
These mean-difference values form the sampling distribution under the null hypothesis.
But — how do we get these many samples?
Variables are names for values:
Arrays are values that are containers for other values. They can contain many values.
We can stick arrays (containers) together with the concatenate function:
We can select the first (e.g.) 4 elements with indexing:
The computer provides routines to generate randomness:
Among many other things, we can randomly shuffle (permute) the values in arrays.
The permutation idea:
beer and water values into one large array.array([14, 33, 27, 11, 12, 27, 26, 25, 27, 27, 22, 36, 37, 3, 23, 7, 25,
17, 36, 31, 30, 22, 20, 29, 23, 33, 23, 23, 13, 24, 8, 4, 21, 24,
21, 26, 27, 22, 21, 25, 20, 7, 3])
z = np.zeros(10000)
# Repeat procedure 10000 times.
for i in range(10000):
# The trial procedure above.
shuffled = rng.permuted(population)
fake_beers = shuffled[:25]
fake_waters = shuffled[25:]
fbm = np.mean(fake_beers)
fwm = np.mean(fake_waters)
fm_diff = fbm - fwm
# Store the result.
z[i] = fm_diff
# Show the first 10 values.
z[:10]array([-3.02 , 1.94888889, 0.13333333, 4.33777778, -1.39555556,
1.37555556, 3.95555556, -2.54222222, 0.03777778, -1.49111111])
TtestResult(statistic=np.float64(1.6402506050018828), pvalue=np.float64(0.054302080886695414), df=np.float64(41.0))
All material for this talk at https://github.com/matthew-brett/statistics-without.